AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIPCode: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 uszipcode==1.0.1 sqlalchemy_mate==1.4.28.4 -q --user
Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 19.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 24.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 27.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/77.1 kB 2.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 25.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.0/57.0 kB 2.3 MB/s eta 0:00:00 Building wheel for atomicwrites (setup.py) ... done WARNING: The scripts f2py, f2py3 and f2py3.10 are installed in '/root/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 1.5.3 which is incompatible. google-colab 1.0.0 requires pandas==2.1.4, but you have pandas 1.5.3 which is incompatible. ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.54 which is incompatible. mizani 0.11.4 requires pandas>=2.1.0, but you have pandas 1.5.3 which is incompatible. pandas-stubs 2.1.4.231227 requires numpy>=1.26.0; python_version < "3.13", but you have numpy 1.25.2 which is incompatible. plotnine 0.13.6 requires pandas<3.0.0,>=2.1.0, but you have pandas 1.5.3 which is incompatible. xarray 2024.9.0 requires pandas>=2.1, but you have pandas 1.5.3 which is incompatible.
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# libraries necessary for model building
# import libraries to split data into training and test sets
from sklearn.model_selection import train_test_split
# import libraries to build decision tree models
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to compute tree classification metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
recall_score,
precision_score,
f1_score,
)
# import libraries to tune different models
from sklearn.model_selection import GridSearchCV
# import libraries to perform data preprocessing
from sklearn.preprocessing import StandardScaler
# import libraries to interpret Zipcode values
from uszipcode import SearchEngine
# to suppress unnecessary warnings
# import warnings
# warnings.filterwarnings('ignore')
/root/.local/lib/python3.10/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
# run the following lines for Google Colab authorization via a dialog window
from google.colab import drive
drive.mount('/content/drive')
# loading the dataset via the file and its directory
folder_path = '/content/drive/MyDrive/AI-ML Post Graduate/Data Documents/'
file_name = 'Loan_Modelling.csv'
dataset = pd.read_csv(folder_path+file_name)
Mounted at /content/drive
In order to avoid altering the original data, and to manipulate without danger of loss of data, we are going to make a copy of the dataset to use throughout this notebook.
# making a copy of the dataset
data = dataset.copy()
First, we shall observe the content of our dataset; also called Data Overview.
# observing the first and last 5 rows of the dataset
data
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
5000 rows × 14 columns
# to obtain the number of rows and column of the dataset
rows, columns = data.shape
print(f'The dataset has {rows} rows and {columns} columns.')
The dataset has 5000 rows and 14 columns.
# let us visualize the datatype of the columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Observations:
CCAvg, all the columns are whole numbers, as in type integer.CCAvg is the only continuous datatype in the dataset.# we shall visualize the statistical summary of the columns
data.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
Observations:
Note: There are more observations and inferences that can be done, however since this is simply an Overview, further inquiry will be done in the Exploratory Data Analysis.
Before we investigate further on this data, we have to proceed with sanity checks.
# visualizing missing values by the sum of NaN values
# since all the columns are numerical types
data.isna().sum()
| 0 | |
|---|---|
| ID | 0 |
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
# Let's observe if there are duplicated rows
dups = data.duplicated().sum()
print(f"There are {dups} duplicated rows")
There are 0 duplicated rows
Questions:
Let us first observe the distribution of the columns.
# observing the distribution of the columns via a histogram
# setting the size of the overall figure
plt.figure(figsize=(15, 10))
# making a list of all the data columns
# excluding ID and ZIPCode since their numerical value
# does not provide any meaningful insight
data_col = data.drop(columns=["ID", "ZIPCode"]).columns.to_list()
# plotting the histogram for each attribute of the dataset
# and appending it to the overall figure
for i, attribute in enumerate(data_col):
plt.subplot(3,4, i+1)
sns.histplot(data = data, x = attribute)
# showing the figure tightly
plt.tight_layout()
plt.show()
Observations:
Income and CCAvg hold a right skewed distribution, with most people holding less than 100k dollars and spending less than ~3k dollars.Age and Experience column hold a multimodal distribution, with the age of people holding peaks at approximately 30, 39, 51, and 59 years of age; while Experience at apporximately 5, 11, 20, 29, and 35 years of professional experience.# observing the distribution of the columns via a boxplot
# setting the size of the overall figure
plt.figure(figsize=(15, 10))
# making a list of all the data columns
# excluding ID, ZIPCode and the binary columns (those that hold either 0 or 1)
# since their numerical value does not provide any meaningful insight
data_col = data.drop(columns=["ID", "ZIPCode", "Personal_Loan", "Securities_Account", "CD_Account", "Online", "CreditCard"]).columns.to_list()
# plotting the boxplot
# for each attribute of the dataset from the list
# and appending it to the overall figure
for i, attribute in enumerate(data_col):
plt.subplot(3,3, i+1)
sns.boxplot(data = data, x = attribute)
# showing the figure tightly
plt.tight_layout()
plt.show()
Observations:
Income holds 75% of customers below a 100k dollars income. Similarly, CCAvg shows 75% of customers spending below 3k dollars on credit cards. However, we can start to visualize outliers in both of these plots, primarily in Credit Card Spending.Thanks to this visualization, we have started to observe outliers. These outliers can represent either erroneous values, or simply a small percentage of values compared to the overall data. The treatment for these will be done in Data Preprocessing.
Before moving on Multivariate analysis, we shall answer one of the given questions.
We can derive the answer to this question by excluding those who do not spent money on credit cards.
# Exctrating a smaller dataframe that
# holds those with average credit card spending of 0.
# From that dataset we extract the number of rows that
# represent the number of customers.
customers = data[data["CCAvg"] == 0].shape[0]
# we subtract this number from the overall customer amount
customers = data.shape[0] - customers
# Printing the number of customers with credit cards
print(f"There are {customers} customers with credit cards.")
There are 4894 customers with credit cards.
During our Exploratory Data Analysis, we ignored the value of ZIPCode as it did not provide any meaningful insights on its numerical value. However, we will inquire further on their value to acquire relevant information.
This will be done via the library uszipcode, which provides information on US Zipcodes. Since our bank is settled in the US, we shall assume that most, if not all customers in our dataset hold US Zipcodes and not foreign values.
# Let's see first how many unique zipcodes are in the data
unique_zipcodes = data["ZIPCode"].unique().shape[0]
print(f"There are {unique_zipcodes} unique zipcodes.")
There are 467 unique zipcodes.
Now we will work with uszipcode, specifically with its main instance of SearchEngine() that allows us to obtain information about the zipcode from its numerical value.
# Creating the instance of SearchEngine()
zip_search = SearchEngine()
Download /root/.uszipcode/simple_db.sqlite from https://github.com/MacHu-GWU/uszipcode-project/releases/download/1.0.1.db/simple_db.sqlite ... 1.00 MB downloaded ... 2.00 MB downloaded ... 3.00 MB downloaded ... 4.00 MB downloaded ... 5.00 MB downloaded ... 6.00 MB downloaded ... 7.00 MB downloaded ... 8.00 MB downloaded ... 9.00 MB downloaded ... 10.00 MB downloaded ... 11.00 MB downloaded ... Complete!
Let us observe the range of values given to a single zipcode.
# Using the method of the instance to search the zipcode
# and giving the value in the dictionary type
zip_search.by_zipcode(94116).to_dict()
{'zipcode': '94116',
'zipcode_type': 'STANDARD',
'major_city': 'San Francisco',
'post_office_city': 'San Francisco, CA',
'common_city_list': ['San Francisco'],
'county': 'San Francisco County',
'state': 'CA',
'lat': 37.74,
'lng': -122.48,
'timezone': 'America/Los_Angeles',
'radius_in_miles': 2.0,
'area_code_list': '415',
'population': 43698,
'population_density': 16901.0,
'land_area_in_sqmi': 2.59,
'water_area_in_sqmi': 0.04,
'housing_units': 16283,
'occupied_housing_units': 15445,
'median_home_value': 734400,
'median_household_income': 83407,
'bounds_west': -122.510407,
'bounds_east': -122.458635,
'bounds_north': 37.764001,
'bounds_south': 37.733771}
From all the values given, we will only use some of them. Additionally, since some of these values are not given for some of the zipcodes, we will use the values that can represent the zipcode. These values are the following:
During our analysis, if any of these values do not hold any meaningful insights into our objective, we will drop these attributes from the analysis moving forward. To ease our analysis, we will make a dataset from all of the unique and valid zipcodes.
# We will create an empty dataframe for the zipcodes
# and a temporary list to hold the values
zipcodes = pd.DataFrame()
temp_list = []
# Cycling through all the unique zipcodes in the dataset
for zc in data["ZIPCode"].unique():
# the try section will execute the comands inside
# only if nothing causes an exception
try:
# We search for the specific zipcode
zip_info = zip_search.by_zipcode(zc)
# creating a dictionary filled with the zipcode's values
zip_dict = {
"ZIPCode": zc,
"Major_City": zip_info.major_city,
"State": zip_info.state,
"County": zip_info.county,
"Latitude": zip_info.lat,
"Longitude": zip_info.lng
}
# the dictionary wil be appended to the temporary list
temp_list.append(zip_dict)
# If an exception occured, such as a non valid zipcode entered
# the except will enter the zipcode searched and empty values in the dict
# which it'll append to the list
except:
zip_dict = {
"ZIPCode": zc,
"Major_City": None,
"State": None,
"County": None,
"Latitude": np.NaN,
"Longitude": np.NaN
}
temp_list.append(zip_dict)
# When all the zipcodes are searched for,
# the temporary list is incorporated into the dataframe created.
zipcodes = pd.DataFrame(temp_list)
# printing the dataframe to validate the operation
zipcodes
| ZIPCode | Major_City | State | County | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| 0 | 91107 | Pasadena | CA | Los Angeles County | 34.16 | -118.08 |
| 1 | 90089 | Los Angeles | CA | Los Angeles County | 34.02 | -118.29 |
| 2 | 94720 | Berkeley | CA | Alameda County | 37.87 | -122.25 |
| 3 | 94112 | San Francisco | CA | San Francisco County | 37.72 | -122.44 |
| 4 | 91330 | Northridge | CA | Los Angeles County | 34.25 | -118.53 |
| ... | ... | ... | ... | ... | ... | ... |
| 462 | 90068 | Los Angeles | CA | Los Angeles County | 34.13 | -118.33 |
| 463 | 94970 | Stinson Beach | CA | Marin County | 37.91 | -122.65 |
| 464 | 90813 | Long Beach | CA | Los Angeles County | 33.78 | -118.18 |
| 465 | 94404 | San Mateo | CA | San Mateo County | 37.55 | -122.26 |
| 466 | 94598 | Walnut Creek | CA | Contra Costa County | 37.91 | -122.01 |
467 rows × 6 columns
With our new dataset of zipcodes, let's run a simple overview before analysing it's values.
zipcodes.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ZIPCode | 467.0 | NaN | NaN | NaN | 93077.233405 | 1806.56932 | 90005.0 | 91743.0 | 93022.0 | 94605.0 | 96651.0 |
| Major_City | 463 | 244 | Los Angeles | 35 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| State | 463 | 1 | CA | 463 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| County | 463 | 38 | Los Angeles County | 116 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Latitude | 463.0 | NaN | NaN | NaN | 35.671102 | 2.14791 | 32.55 | 33.94 | 34.39 | 37.74 | 41.76 |
| Longitude | 463.0 | NaN | NaN | NaN | -119.816911 | 2.070952 | -124.11 | -122.02 | -118.95 | -118.005 | -115.65 |
Observations:
A simple glance tells us that the majority of the customers in this sample hail from Los Angeles, CA. However, we must delve deeper to conclude if these values are relevant to our prediction procedures.
In order to get relevant insights from these values, we must append them to the original dataframe.
# Before we append, let's see its metadata
zipcodes.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 467 entries, 0 to 466 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ZIPCode 467 non-null int64 1 Major_City 463 non-null object 2 State 463 non-null object 3 County 463 non-null object 4 Latitude 463 non-null float64 5 Longitude 463 non-null float64 dtypes: float64(2), int64(1), object(3) memory usage: 22.0+ KB
We can observe that there are missing values in the data in the form of null objects. Our decision on the matter will be reflected and realized in Data Preprocessing.
Now, we shall combine the two datasets into one using the Pandas' function merge(). Thankfully, since both datasets use the same values for ZIPCode, merging them is very simple.
# Merging the two datasets with "ZIPCode" as their anchor.
# In order to avoid mismanipulation of our dataset, let us make a copy
data_w_zip = pd.merge(data, zipcodes, on="ZIPCode", copy=True)
# returning the dataset to validate the procedure.
data_w_zip
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Major_City | State | County | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Pasadena | CA | Los Angeles County | 34.16 | -118.08 |
| 1 | 456 | 30 | 4 | 60 | 91107 | 4 | 2.2 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Pasadena | CA | Los Angeles County | 34.16 | -118.08 |
| 2 | 460 | 35 | 10 | 200 | 91107 | 2 | 3.0 | 1 | 458 | 0 | 0 | 0 | 0 | 0 | Pasadena | CA | Los Angeles County | 34.16 | -118.08 |
| 3 | 576 | 54 | 30 | 93 | 91107 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Pasadena | CA | Los Angeles County | 34.16 | -118.08 |
| 4 | 955 | 37 | 12 | 169 | 91107 | 2 | 5.2 | 3 | 249 | 1 | 0 | 0 | 1 | 0 | Pasadena | CA | Los Angeles County | 34.16 | -118.08 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4081 | 27 | 0 | 40 | 90068 | 1 | 2.0 | 2 | 110 | 0 | 0 | 0 | 0 | 1 | Los Angeles | CA | Los Angeles County | 34.13 | -118.33 |
| 4996 | 4347 | 45 | 21 | 33 | 94970 | 3 | 0.5 | 1 | 136 | 0 | 0 | 1 | 1 | 1 | Stinson Beach | CA | Marin County | 37.91 | -122.65 |
| 4997 | 4624 | 50 | 25 | 45 | 90813 | 2 | 0.6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Long Beach | CA | Los Angeles County | 33.78 | -118.18 |
| 4998 | 4802 | 34 | 10 | 88 | 94404 | 2 | 0.0 | 1 | 121 | 0 | 0 | 0 | 1 | 0 | San Mateo | CA | San Mateo County | 37.55 | -122.26 |
| 4999 | 4868 | 38 | 12 | 61 | 94598 | 4 | 0.2 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Walnut Creek | CA | Contra Costa County | 37.91 | -122.01 |
5000 rows × 19 columns
Now that we have a dataset with zipcode values, we can employ this to reflect insights on our customer samples.
Note: During the analysis, we will use both the data and data_w_zip; only using the latter to observe interactions with the zipcode in order to reduce unnecesary processing power and simplicity in the plots.
# Let us see which county is the most popular in our sample
# setting the size
plt.figure(figsize=(5,10))
sns.countplot(
# selecting our combined dataset
data=data_w_zip,
# to provide better legibility, we shall plot on the Y axis instead of X
y="County",
# scaling the data to a percentage
stat='percent',
# ordering the data for better visualization
order = data_w_zip["County"].value_counts(ascending=True).index
);
# showing the plot
plt.show()
Observations:
# Since there are 244 unique Major Cities,
# instead of a plot we will visualize with a table (Pandas Series to be exact.)
data_w_zip["Major_City"].value_counts(normalize=False).head(10)
| Major_City | |
|---|---|
| Los Angeles | 375 |
| San Diego | 269 |
| San Francisco | 257 |
| Berkeley | 241 |
| Sacramento | 148 |
| Palo Alto | 130 |
| Stanford | 127 |
| Davis | 121 |
| La Jolla | 112 |
| Santa Barbara | 103 |
Observations:
# Let us visualize the distribution of Latitude and Longitude
# to observe if there may be relevant information
# Setting the size
plt.figure(figsize=(10,5))
# creating a temporary list to cycle through with an incremental variable
temp_list = ["Latitude", "Longitude"]
i=1
# creating the loop for the attributes
for attribute in temp_list:
# draws the plot in the "i"th position.
plt.subplot(2,2,i)
sns.histplot(data=data_w_zip, x=attribute, kde=True)
# incrementing the position variable and drawing a second time
i+=1
plt.subplot(2,2,i)
sns.boxplot(data=data_w_zip, x=attribute)
i+=1
# shows the overall plot
plt.tight_layout()
plt.show()
Observations:
Inferences being made using these two attributes are scarce without further parameters. If they do not show relevant information, the chances of removing these attributes increases.
In this section we shall analyse attributes with one another to derive insights from them.
# setting the size
plt.figure(figsize=(15,10))
# creating the heatmap to visualize the correlation between numerical values.
sns.heatmap(
# selecting only the numerical values that hold meaning
data_w_zip.drop(columns=["ID","ZIPCode"]).corr(numeric_only=True),
# displaying the correlation value
annot=True,
# variable for the colorpalette of the heatmap
cmap="coolwarm",
# minimum value
vmin=-1,
# maximum value
vmax=1,
# format to show the correlation
fmt='.2f'
);
# showing the size
plt.figure(figsize=(15,10))
# selecting only the numerical data
num_data = data_w_zip.drop(columns=["ID","ZIPCode", "Online", "CreditCard","Securities_Account", "CD_Account", "Family", "Major_City", "State", "County"])
# Drawing a pairplot of all the numerical data with Personal_Loan as a hue
sns.pairplot(num_data, hue="Personal_Loan");
<Figure size 1500x1000 with 0 Axes>
Observations:
Personal_Loan, Income is positively correlated; with CCAvg and CD_Account holding less positive correlation value, but meaningful nonetheless. Education and Mortgage hold little but relevant positive correlation with it.Income and CCAvg hold very high positive correlation. Additionally, both Family and Education hold a small negative correlation with both of these attributes.Personal_Loan and Family. However, it also holds small negative correlation with Income and CCAvg.CD_Account holds positive correlation, in descending order of magnitude, with Personal_Loan, Securities_Account, CreditCard, Online, Income, CCAvg and Mortgage. After seeing the correlation heatmap and the pairplot, we shall probe further into those that hold correlation for better analysis.
Note: Some of this alaysis can be seen in the pairplot, however due to its grand size, we shall generate simpler plots for analysis.
We will be seeing Personal Loan throught this analysis, however this specific section will emphazise those that hold positive correlation with Personal Loan.
# Visualizing the relationship between CCAvg, Income and Personal_Loan
sns.scatterplot(data=data, x="Income", y="CCAvg", hue="Personal_Loan");
Observations:
# Visualizing the retlationship between CD_Account, CCAvg and Personal_Loan
# in two graphs
plt.figure(figsize=(10,5))
# first plot
plt.subplot(1,2,1)
sns.boxplot(data=data, x="CD_Account", y="CCAvg", hue="Personal_Loan")
# second plot
plt.subplot(1,2,2)
sns.countplot(data=data, x="CD_Account", hue="Personal_Loan")
# showing the overall plot
plt.tight_layout()
plt.show()
Observations:
CCAvg and Personal Loans can be seen here as well, due to 50% of those who accepted spending more than 3.5k dollars in contrast to those that didn't with 75% of them spending less than 3k dollars.CD_Account and Personal Loan can be seen in the outliers. People that do have a certificate are less than those who do not have the certificate.We shall answer one of the questions proposed.
# visualizing the relationship between age and Personal_Loan
# in a subplot, as in two plots.
# Setting the size
plt.figure(figsize=(10,5))
# drawing the first plot, a histogram
plt.subplot(1,2,1)
sns.histplot(data=data, x="Age", hue="Personal_Loan");
# drawing the second plot, a box plot
plt.subplot(1,2,2)
sns.boxplot(data=data, x="Age", hue="Personal_Loan");
plt.show()
Observations:
We will continue our analysis in order of appearance in the heatmap.
# Visualizing the retlationship between Family, CCAvg and Income
# in two graphs
plt.figure(figsize=(10,5))
# first plot
plt.subplot(1,2,1)
sns.boxplot(data=data, hue="Family", y="CCAvg", palette="Set2");
# second plot
plt.subplot(1,2,2)
sns.boxplot(data=data, hue="Family", y="Income", palette="Set2");
# showing the overall plot
plt.tight_layout()
plt.show()
Observations:
# Visualizing the retlationship between Education, CCAvg and Income
# in two graphs
plt.figure(figsize=(10,5))
# first plot
plt.subplot(1,2,1)
sns.boxplot(data=data, hue="Education", y="CCAvg", palette="Set2");
# second plot
plt.subplot(1,2,2)
sns.boxplot(data=data, hue="Education", y="Income", palette="Set2");
# showing the overall plot
plt.tight_layout()
plt.show()
Observations:
Family's correlation, Income and CCAvg values tend to decrease as Education advances from 1 to 2. Nonetheless, this correlation is small.# Visualizing the correlation between Personal Loan and Education
sns.countplot(data=data, x="Education", hue="Personal_Loan", stat="percent");
plt.show()
Observations:
Education increases, so does those who accepted the loan.Education increases.# Visualizing the relationship with mortgag, income and personal_loan
sns.scatterplot(data=data, x="Mortgage", y="Income", hue="Personal_Loan");
# Since the relationship of mortgage and personal loan is hard to visualize,
# we shall use a kde plot to simplify its density
sns.kdeplot(data=data, x="Mortgage", hue="Personal_Loan", common_norm=False);
Observations:
Mortgage than those who have a mortgage value and rejected the loan.CD_Account has positive correlation value with other attributes we have yet to discuss. We shall analyse those attributes in this section.
# Visualizing CD_Account's relationship with the other attributes
# Setting the size
plt.figure(figsize=(12,7))
# Temporary list to hold the attributes
temp_list = ["Securities_Account", "Online", "CreditCard"]
# cycle through the attributes to draw its plot
for i, attribute in enumerate(temp_list):
plt.subplot(2,2,i+1)
sns.countplot(data=data, x=attribute, hue="CD_Account", stat="percent");
# showing the overall plot
plt.tight_layout()
plt.show()
Observations:
# Let us see which county is the most popular in our sample
# setting the size
plt.figure(figsize=(10,10))
sns.countplot(
# selecting our combined dataset
data=data_w_zip,
# to provide better legibility, we shall plot on the Y axis instead of X
y="County",
# scaling the data to a percentage
stat='percent',
# ordering the data for better visualization
order = data_w_zip["County"].value_counts().index,
hue = "Personal_Loan"
);
# showing the plot
plt.show()
Observations:
This shows that employing zipcode values may not heavily influence in the model.
Previously, we looked at missing values while doing a data overview. However, after employing the values given by the zipcode, some zipcodes were invalid. Let's review those.
# extracting null values, as in missing meaningful values
data_w_zip.isnull().sum()
| 0 | |
|---|---|
| ID | 0 |
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
| Major_City | 34 |
| State | 34 |
| County | 34 |
| Latitude | 34 |
| Longitude | 34 |
# Counting how many entries contain our dependant variable: Personal_Loans
# We search for those with NaN values in Latitude,
# select our desired column, "Personal_Loan"
# and count the values.
data_w_zip[data_w_zip["Latitude"].isna()]["Personal_Loan"].value_counts()
| Personal_Loan | |
|---|---|
| 0 | 31 |
| 1 | 3 |
From the 5000 entries, 34 of those contain an error due to the zipcode; only 3 of those accepted the loan. Since this is a small number of missing values, we are going to drop these values.
# dropping those values
data_w_zip.dropna(inplace=True)
# veryfying the procedure
data_w_zip.isnull().sum()
| 0 | |
|---|---|
| ID | 0 |
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
| Major_City | 0 |
| State | 0 |
| County | 0 |
| Latitude | 0 |
| Longitude | 0 |
Now, we will continue with removing meaningless columns for our model. The following are the attributes that bare small, minimum or no value to the model:
ID: The identifier of the customer. Since we have no information relevant to the identifier, we shall drop this column.Age & Experience: By themselves, these do not appear to hold any correlation to the other variables, making them seem irrelevant. We shall experiment with these during our model building.ZIPCode: By itself, the numerical value of ZIPCode brings no further insights into our model.Latitude & Longitude: The impact of the location of the customer has little if any relevance to the loans in our analysis.State: Since all of the customers in our sample are from CA, this column is redundant. Furthermore, one can infer that the model that we build is based solely on CA. It is best to avoid that inference.Major_City & County: The abundance of unique values for these columns brings a lot of noise into our model. Aditionally, since we only have CA data, we would be overfitting for CA customers. We shall drop these attributes as well from the modeling.Note: These inferences are subjective to a small extent, so a different individual might disagree with these results.
# Let's remove those columns from the dataset.
# We shall make a new variable for this new dataset.
final_data = data_w_zip.drop(columns=["ID", "ZIPCode", "Latitude", "Longitude", "State", "Major_City", "County"])
# Verifying the procedure
final_data.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 30 | 4 | 60 | 4 | 2.2 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 35 | 10 | 200 | 2 | 3.0 | 1 | 458 | 0 | 0 | 0 | 0 | 0 |
| 3 | 54 | 30 | 93 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 37 | 12 | 169 | 2 | 5.2 | 3 | 249 | 1 | 0 | 0 | 1 | 0 |
Before we get our final preparations, we have to review the outliers and reflect on them.
# observing the distribution of the columns via a boxplot
# setting the size of the overall figure
plt.figure(figsize=(10, 5))
# making a list of all the data columns
# excluding ID, ZIPCode and the binary columns (those that hold either 0 or 1)
# since their numerical value does not provide any meaningful insight
data_col = final_data.drop(columns=["Personal_Loan", "Education", "Securities_Account", "CD_Account", "Online", "CreditCard"]).columns.to_list()
# plotting the boxplot
# for each attribute of the dataset from the list
# and appending it to the overall figure
for i, attribute in enumerate(data_col):
plt.subplot(2,3, i+1)
sns.boxplot(data = data, x = attribute)
# showing the figure tightly
plt.tight_layout()
plt.show()
# Let us focus the graphs on the outliers
# using the subplot "routine" we have been using
plt.figure(figsize=(15,5))
outliers = ["Income", "CCAvg", "Mortgage"]
for i, attribute in enumerate(outliers):
plt.subplot(2,2, i+1)
sns.boxplot(data = data, x = attribute)
plt.tight_layout()
plt.show()
From the continuous variables or attributes, we detect outliers in three of them. Income, CCAvg, and Mortgage. Let's discuss them:
Income & CCAvg: The outliers here are the result of natural distribution of the data. There are two clusters of outliers to mention. In Income they are around and before 200k dollars, while in CCAvg are around and before 9k dollars. These sets of outliers represent natural information. However, there is a small cluster in both of these columns, above the other outlier cluster. Let's investigate further for these values.Mortgage: The outliers are a result of the most popular value of the Mortgage, being of 0, skewing the distribution. These outliers represent actual information, therefore we will keep these valies.data[ data["Income"] > 200].sort_values(by="Income", ascending=False)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 3897 | 48 | 24 | 224 | 93940 | 2 | 6.67 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
| 4993 | 4994 | 45 | 21 | 218 | 91801 | 2 | 6.67 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 526 | 527 | 26 | 2 | 205 | 93106 | 1 | 6.33 | 1 | 271 | 0 | 0 | 0 | 0 | 1 |
| 2988 | 2989 | 46 | 21 | 205 | 95762 | 2 | 8.80 | 1 | 181 | 0 | 1 | 0 | 1 | 0 |
| 677 | 678 | 46 | 21 | 204 | 92780 | 2 | 2.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2278 | 2279 | 30 | 4 | 204 | 91107 | 2 | 4.50 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4225 | 4226 | 43 | 18 | 204 | 91902 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.00 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3804 | 3805 | 47 | 22 | 203 | 95842 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.00 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3608 | 3609 | 59 | 35 | 202 | 94025 | 1 | 4.70 | 1 | 553 | 0 | 0 | 0 | 0 | 0 |
| 1711 | 1712 | 27 | 3 | 201 | 95819 | 1 | 6.33 | 1 | 158 | 0 | 0 | 0 | 1 | 0 |
| 1901 | 1902 | 43 | 19 | 201 | 94305 | 2 | 6.67 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| 2337 | 2338 | 43 | 16 | 201 | 95054 | 1 | 10.00 | 2 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2447 | 2448 | 44 | 19 | 201 | 95819 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4895 | 4896 | 45 | 20 | 201 | 92120 | 2 | 2.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data[ data["CCAvg"] > 9].sort_values(by="CCAvg", ascending=False)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2337 | 2338 | 43 | 16 | 201 | 95054 | 1 | 10.0 | 2 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3943 | 3944 | 61 | 36 | 188 | 91360 | 1 | 9.3 | 2 | 0 | 1 | 0 | 0 | 0 | 0 |
Observations:
CCAvg also tend to have high income, thus proving that expenditure amount fairly reasonable.Income also tend to have outliers from CCAvg. However, since we do not have further information on this specific domain of knowledge, we shall ignore these outliers.Now that the outliers have been addressed, we shall prepare the data for modeling.
We shall split the data into two variables. One that holds only our response variable (dependent variable, in this case Personal_Loan) and another that holds the rest of the explanatory variables (independent variables). Afterwards, we shall split the data into two sets. One that holds data to train the model and another that holds data to test the model.
Since we do not have any categorical attributes with non-numerical values, we can avoid having to create what are known as "dummy variables"
# Defining the explanatory and response variables
X = final_data.drop(columns="Personal_Loan")
y = final_data["Personal_Loan"]
As we split the data, we will use a specified random state. This ensures randomness in the splitting; nonetheless the reason for a defined random state is to ensure a similar split each time for reproducible results.
# Assigning our random state
RS = 69
# Splitting the data to train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
# this is the percentage of the data that will go to the test set
test_size=0.2,
# The random state as defined previously
random_state=RS,
# stratify will split the response variable equally in the training and test
# preserving the ratio of value between the diferent sets.
stratify = y
)
Let us visualize the split:
# Printing the shape of the train and test independent variables
# as well as the division of our desired variable.
print(f"Shape of the Training Set: {X_train.shape}")
print(f"Shape of the Test Set: {X_test.shape}")
print("Percentage of response variables (classes) in training set:")
print(y_train.value_counts(normalize=True) * 100)
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True) * 100)
Shape of the Training Set: (3972, 11) Shape of the Test Set: (994, 11) Percentage of response variables (classes) in training set: 0 90.382679 1 9.617321 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 90.442656 1 9.557344 Name: Personal_Loan, dtype: float64
Our objective in this project is to predict whether a liability customer will buy personal loans, which attributes contribute to it and identifying segments of customers to target more. The main model we will build is a decision tree, also known as a classification model. This algorithm, based on certain attributes, calculates the probability of a customer buying the loan or not. However, we must establish criteria to evaluate the model.
The model can make the following predictions:
This will be visualized on the model on a table called "Confusion Matrix". This will be seen later on.
Let us discuss the evaluation metrics for our decision tree. There are 4 known metrics made for decision trees:
Given the nature of their algebraic composition, precision aims to reduce False Positive while Recall aims to reduce False Negatives. However, in this specific case, neither of them is given greater importance than the other. Therefore we will aim to increase the F1 Score without giving priority to the other two metrics.
Note: This is not suggesting that the Precision and Recall are not important. It is stating that we will not strive to maximize one value compared to the other.
With this in mind, we will create a function to display the metrics based on the decision tree as an input, as well as another function to display the confusoin matrix.
# Defining a function to compute 4 different metrics
# in order to monitor performance of a decision tree
def model_performance_metrics(model, predictors, target):
"""
Function to compute 4 different metrics to evaluate classification model performance
model: decision tree
predictors: independent variables
target: dependent variable
"""
# predictingg with the model using the independent variables.
predictions = model.predict(predictors)
# Based on the predictions, compute the different metrics
accuracy = accuracy_score(target, predictions)
recall = recall_score(target, predictions)
precision = precision_score(target, predictions)
f1 = f1_score(target, predictions)
# creating a datafrate to simplify its visualization
df_model_performance = pd.DataFrame(
{
"Accuracy": accuracy,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0]
)
# returns the dataframe created
return df_model_performance
# Defining a function to illustrate the confusion matrix
# of a decision tree in order to visualize the predictions.
def plot_confusion_matrix(model, predictors, target):
"""
Function to plot or visualize the confusion matrix of a decision tree
model: decision tree
predictors: independent variables
target: dependent variable
"""
# predicting with the model using the independent variables
predictions = model.predict(predictors)
# creating a confusion matrix using the dependent variable
# and the model predictions
con_matrix = confusion_matrix(target, predictions)
# formatting the values from the matrix
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / con_matrix.flatten().sum())]
for item in con_matrix.flatten()
]
).reshape(2,2)
# Setting the plot's size
plt.figure(figsize=(6,4))
# plotting the matrix through a heatmap
sns.heatmap(con_matrix, annot=labels, fmt="")
# setting the x and y labels for the matrix
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
# displaying the plot
plt.show()
Now, we shall build the first model. Whether this model will be the final model will be decided based on the evaluation metrics. After building the first model, we will create new models with different parameters, hyperparameters and other techniques to avoid overfitting and underfitting; those models will be in Model Performance Improvement.
# Creating the first model. Technically speaking,
# creating the decision tree instance
# Using the random state defined earlier.
model_1 = DecisionTreeClassifier(random_state=RS)
# Now that the instance is created,
# we will fit the tree to the training data
model_1.fit(X_train, y_train)
DecisionTreeClassifier(random_state=69)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=69)
Now that the model has been trained with the training data, we will observe the confusion matrix and the evaluation metrics for this model.
# Printing the confusion matrix and the metrics of the training data
plot_confusion_matrix(model_1, X_train, y_train)
model_performance_metrics(model_1, X_train, y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
As we can see, the model perfectly identified the True Negatives, being 90.38%; and the True Negatives, our desired predictor of 9.62% of customers. Observe that there are no errors, meaning that the evaluation metrics have reached their maximum value.
Now we will evaluate the same mode with the test data.
# Printing the confusion matrix and the metrics of the test data
plot_confusion_matrix(model_1, X_test, y_test)
model_performance_metrics(model_1, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971831 | 0.831579 | 0.868132 | 0.849462 |
With the test data, new errors in the predictions can be observed. As the training data was perfect compared to the test data, this is an example of overfitting the data; in other words the model is better at predicting the training data and not apt to unseen data. This is an issue that we'll be attempting to solve during the improvements.
Before we continue; as we will be printing the confusion matrix with the evaluation scores multiple times, a new function will be created to simplify this process.
# creating a function to simplify the execution of this code
def evaluate_model(model, predictors, target):
plot_confusion_matrix(model, predictors, target)
# as the perfomance metric is a variable,
# we will return it as well to be displayed easily
return model_performance_metrics(model, predictors, target)
# Verifying this process
evaluate_model(model_1, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971831 | 0.831579 | 0.868132 | 0.849462 |
The second model we will build will be created with a new hyperparameter. DUring the creation of the model, we will establish class_weight="balanced". This will automatically adjust the weights of the decision tree's nodes to be inversely proportional to the class frequencies in the input data.
# creting the second model with balanced class weight
model_2 = DecisionTreeClassifier(random_state=RS, class_weight="balanced")
# fitting the model to the training data
model_2.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=69)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight='balanced', random_state=69)
# Evaluating the model
evaluate_model(model_2, X_train, y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Now let's evaluate on the test data
evaluate_model(model_2, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.977867 | 0.831579 | 0.929412 | 0.877778 |
Observations:
Now, we will apply what are known as pruning techniques to reduce overfitting.
Pre-Pruning a Decision Tree causes the tree to grow until it has reached limits indicated by hyperparameters, similar to class_weight = "balanced".
These hyperparameters are to be established before the tree begins to adapt to the training dataset. This causes an impact to the performance in the training data, causing a reduction in overfitting. The introduction in this abstractions causes the tree to become adaptable to unseen data, therefore an improvement in the test data.
We will create a loop to find the correct values for three of the most common hyperparameters in order to get an improved F1 Score, since we are not aiming to maximize either the Recall or Precision values.
# Define the parameters of the tree to iterate
# The maximum depth of the tree
# in a range from 2 to lower than 10, by 1 step
max_depth_values = np.arange(2, 11, 1)
# The maximum number of leaf nodes
max_leaf_nodes_values = [20, 30, 40, 50, 75, 100]
# The minimum number of samples in a node to split it
min_sample_split_values = [20, 30, 50, 70, 100]
# Initialize variables to store the best model and its f1 performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iteration over all the combination of hyperparameters
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_sample_split in min_sample_split_values:
# Initialize a tree with the current hyperparameters in the loop
estimator = DecisionTreeClassifier(
# max depth hyperparamenter
max_depth = max_depth,
# max leaf hyperparameter
max_leaf_nodes=max_leaf_nodes,
# min samples split hyperparameter
min_samples_split=min_sample_split,
# setting the balanced class hyperparameter
class_weight='balanced',
# The Random State constant
random_state=RS
)
# With this estimator, we will fit the model to the training data
estimator.fit(X_train, y_train)
# Predicting with the training and test data
y_train_prediction = estimator.predict(X_train)
y_test_prediction = estimator.predict(X_test)
# Calculate the f1 scores for the training and test sets
train_f1_score = f1_score(y_train, y_train_prediction)
test_f1_score = f1_score(y_test, y_test_prediction)
# Calculate the absolute diference between train and testing f1 scores
score_diff = abs(train_f1_score - test_f1_score)
# Update the best estimator and score
# if the difference is smaller than the best difference
# and if the current score is bigger than the best score
if (score_diff < best_score_diff) & (test_f1_score > best_test_score):
# updating the difference
best_score_diff = score_diff
# updating the score
best_test_score = test_f1_score
# updating the estimator
best_estimator = estimator
# With the best estimator obtained, we will print its parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test F1 score: {best_test_score}")
Best parameters found: Max depth: 5 Max leaf nodes: 20 Min samples split: 30 Best test F1 score: 0.8303571428571428
With the best parameters found from this iteration, we will create the instance from the best pre-pruned tree.
# creating an instance of the best model
# with the estimator created from the iteration
model_3 = best_estimator
# fitting the model to the training data
model_3.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=20,
min_samples_split=30, random_state=69)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=20,
min_samples_split=30, random_state=69)# Let's evaluate this model with trained data
evaluate_model(model_3, X_train, y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.960473 | 1.0 | 0.70872 | 0.829533 |
# Let's evaluate this model with test data
evaluate_model(model_3, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961771 | 0.978947 | 0.72093 | 0.830357 |
Observe that the model is giving generalized results. This shows that the model is less overfitted.
Post pruning consists of modifying an already trained tree in an effort to reduce its size and complexity. This is mainly accomplished with its cost complexity parameter ccp_alpha.
Higher values of this parameter reflects on the increased number of nodes pruned. In order to find the best tree, we will search for a point where the cost and nodes pruned are equally balanced; in other words, before the point where the nodes pruned start losing their value.
We will be using DecisionTreeClassifier.cost_complexity_pruning_path to observe the effective alphas and the impurities of the total leafs. With each increment in alpha, so does the tree gets pruned increasing the impurity.
# let us create a tree to prune
dec_tree = DecisionTreeClassifier(random_state=RS, class_weight="balanced")
# obtaining the path with the train data
path = dec_tree.cost_complexity_pruning_path(X_train, y_train)
# obtaining the alphas and the impurities
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
# displaying the alphas and the impurities in a dataframe
pd.DataFrame(path).head(10)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -2.128723e-16 |
| 1 | 8.659121e-19 | -2.120064e-16 |
| 2 | 8.813748e-19 | -2.111250e-16 |
| 3 | 2.242094e-18 | -2.088829e-16 |
| 4 | 2.412184e-18 | -2.064707e-16 |
| 5 | 3.092543e-18 | -2.033782e-16 |
| 6 | 3.231708e-18 | -2.001465e-16 |
| 7 | 3.231708e-18 | -1.969148e-16 |
| 8 | 3.487612e-18 | -1.934272e-16 |
| 9 | 5.334637e-18 | -1.880925e-16 |
# From the dataframe, we will visualize using a plot
# creating the figure and the axis
# while setting the size
fig, ax = plt.subplots(figsize=(10,5))
# plotting the alphas and the impurities
ax.plot(
ccp_alphas[:-1],
impurities[:-1],
# the type of marker to use on the plot
marker="o",
# the drawstyle of the plot
drawstyle="steps-post"
)
# seting the label for the x axis
ax.set_xlabel("effective alpha")
# setting the label for the y axis
ax.set_ylabel("total impurity of leaves")
# setting the title of the plot
ax.set_title("Total Impurity vs effective alpha for training set")
# showing the plot
plt.show()
We will train decision trees from the effective alphas. The last value of these is the valule that prunes the whole tree. This last tree is redundant to our analysis, so we will drop it.
# creating a list for all the trees to train
trees = []
# iterating over the alphas
for ccp_alpha in ccp_alphas:
# Building a tree with the iterated value of alpha
tree = DecisionTreeClassifier(
random_state=RS,
ccp_alpha=ccp_alpha,
class_weight="balanced"
)
# training the tree with the training data
tree.fit(X_train, y_train)
# append the tree built into the list
trees.append(tree)
# To help illustrate the redundancy
print(
"Number of nodes in the last tree is: {}\n With ccp_alpha: {}".format(
trees[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 With ccp_alpha: 0.28401785511222305
Now, let us visualize the depth of the tree and the number of nodes according to each alpha.
# Dropping the last tree since it is redundant
trees = trees[:-1]
ccp_alphas = ccp_alphas[:-1]
# obtaining the number of nodes for each tree in the list trees
node_counts = [tree.tree_.node_count for tree in trees]
# obtaining the depth for each tree in the list trees
depth = [tree.tree_.max_depth for tree in trees]
# Plotting the nodes and depths
# setting the size, while ggetting the figure and the axis
fig, ax = plt.subplots(2, 1, figsize=(12,7))
# settings for the first plot
# plotting the alphas with the number of nodes
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
# setting the label of x axis
ax[0].set_xlabel("alpha")
# setting the label of y axis
ax[0].set_ylabel("number of nodes")
# setting the title of the first plot
ax[0].set_title("Number of nodes vs alpha")
# settings for the second plot
# plotting the alphas with the depth of the trees
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
# setting the label of x axis
ax[1].set_xlabel("alpha")
# setting the label of y axis
ax[1].set_ylabel("depth of tree")
# setting the title of the first plot
ax[1].set_title("Depth vs alpha")
# showing the plots
fig.tight_layout()
As we can see, the depth of the tree and the number of nodes decreases as alpha increases. Finally, we will observe how the increment of alpha affects the F1 score of the trees.
# initialize an empty list to hold the trainin scores
train_f1_scores = []
# Iterate through the trees
for tree in trees:
# predict labels using training data
y_train_prediction = tree.predict(X_train)
# calculate the F1 score for the training set predictions
f1_train = f1_score(y_train, y_train_prediction)
# Append the F1 score to the list of trainin F1 scores
train_f1_scores.append(f1_train)
# Repeating the same procedure with the test data
# initialize an empty list to hold the trainin scores
test_f1_scores = []
# Iterate through the trees
for tree in trees:
# predict labels using test data
y_test_prediction = tree.predict(X_test)
# calculate the F1 score for the testing set predictions
f1_test = f1_score(y_test, y_test_prediction)
# Append the F1 score to the list of testing F1 scores
test_f1_scores.append(f1_test)
# Plotting the f1 scores with respect to alpha
fig, ax = plt.subplots(figsize=(15,5))
# setting the x label
ax.set_xlabel("Alpha")
# setting the y label
ax.set_ylabel("F1 Score")
# setting the title
ax.set_title("F1 Score vs Alpha")
# plotting the training f1 scores first
ax.plot(
ccp_alphas,
train_f1_scores,
marker="o",
drawstyle="steps-post",
label="train"
)
# plotting the test f1 scores afterwards
ax.plot(
ccp_alphas,
test_f1_scores,
marker="o",
drawstyle="steps-post",
label="test"
)
# adding a legend to the plot
ax.legend();
# creating the model where we get the highest test f1 score
# extracting the location of the highest f1 test score
index_best_model = np.argmax(test_f1_scores)
# Selecting the tree from the previous index
model_4 = trees[index_best_model]
# printing the tree
print(model_4)
DecisionTreeClassifier(ccp_alpha=0.000264322430759792, class_weight='balanced',
random_state=69)
With the best tree from the post-pruning process, we will evaluate this tree.
evaluate_model(model_4, X_train, y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997231 | 1.0 | 0.97201 | 0.985806 |
evaluate_model(model_4, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979879 | 0.884211 | 0.903226 | 0.893617 |
The results are comparable between the training and test data, indicating a generalized performance.
After creating various trees with different methods, we will evaluate and compare all of the trees with the purpose of choosing the best model. In order to do this, we will create a data frame holding its evaluation metrics.
# Performance comparison in training data
# extracting the value from the performance with our function
# and concatenating to a new dataframe
model_comparisons_train = pd.concat(
[
# evaluating all the models to extract the metrics
model_performance_metrics(model_1, X_train, y_train).T,
model_performance_metrics(model_2, X_train, y_train).T,
model_performance_metrics(model_3, X_train, y_train).T,
model_performance_metrics(model_4, X_train, y_train).T,
],
# stating whether the values are rows or columns (1 for columns)
axis=1
)
# creating the labels for the columns
model_comparisons_train.columns = [
"Default Model",
"Class_weight",
"Pre-Pruned Tree",
"Post_Pruned Tree"
]
# Performance comparison in testing data
# extracting the value from the performance with our function
# and concatenating to a new dataframe
model_comparisons_test = pd.concat(
[
# evaluating all the models to extract the metrics
model_performance_metrics(model_1, X_test, y_test).T,
model_performance_metrics(model_2, X_test, y_test).T,
model_performance_metrics(model_3, X_test, y_test).T,
model_performance_metrics(model_4, X_test, y_test).T,
],
# stating whether the values are rows or columns (1 for columns)
axis=1
)
# creating the labels for the columns
model_comparisons_test.columns = [
"Default Model",
"Class_weight",
"Pre-Pruned Tree",
"Post_Pruned Tree"
]
Now, let us visualize them.
print("Training Perfromance comparison")
model_comparisons_train
Training Perfromance comparison
| Default Model | Class_weight | Pre-Pruned Tree | Post_Pruned Tree | |
|---|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.960473 | 0.997231 |
| Recall | 1.0 | 1.0 | 1.000000 | 1.000000 |
| Precision | 1.0 | 1.0 | 0.708720 | 0.972010 |
| F1 | 1.0 | 1.0 | 0.829533 | 0.985806 |
print("Testing Perfromance comparison")
model_comparisons_test
Testing Perfromance comparison
| Default Model | Class_weight | Pre-Pruned Tree | Post_Pruned Tree | |
|---|---|---|---|---|
| Accuracy | 0.971831 | 0.977867 | 0.961771 | 0.979879 |
| Recall | 0.831579 | 0.831579 | 0.978947 | 0.884211 |
| Precision | 0.868132 | 0.929412 | 0.720930 | 0.903226 |
| F1 | 0.849462 | 0.877778 | 0.830357 | 0.893617 |
Observations:
With all the trees compared, analysed and thought upon; we are going to choose the Post_Pruned Tree as our final model, as it shows good scores in both training and test data.
With our tree selected, let us observe the characteristics of it.
# Setting the size of the plot figure
plt.figure(figsize=(20, 10))
# Since we used the variable tree as an iterator,
# it might have overwritten the imported variable.
# Let's reimport it with a diferent name.
from sklearn import tree as _tree
# Extracting the tree
out = _tree.plot_tree(
model_4,
# We import the name from the independet variables' variable
feature_names= list(X_train.columns),
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Setting the arrows if they are not visible
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
# Showing the plot
plt.show()
As we can see, it is still a big tree. Further observing the tree might not be so prudent. Let's observe the Important Features for the model.
# extracting the important features
importances = model_4.feature_importances_
indices = np.argsort(importances)
# setting the size of the plot
plt.figure(figsize=(12, 12))
# setting the title
plt.title("Feature importances")
# plotting the feature importances
plt.barh(
range(len(indices)),
importances[indices],
color="violet",
align="center"
)
# setting the tick labels
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
# setting the x labels
plt.xlabel("Relative Importance")
# showing the plot
plt.show()
As we saw in our EDA, income was highly positively correlated with personal loan, so it was expected to be of importance during the model. However, family was not as highly correlated with personal loan, showcasing that correlation doesn't completely dominate the model.
The model built can be used to predict if a customer will reject or accept a loan.
People with higher size of families, higher than 2 in our analysis, tend to have less income and less average expenditure in credit card.
Higher income also indicates a rise in average credit card expenses.
All of the sampled data comes from CA, so there is a probability that this model is fitted mainly to that particular state, so diverse regional information may be needed to provide further and broader insights.
The eduaction of an individual shows more importance than the years of experience they hold in their decision. The income of the indivual tends to decrease as they hold higher titles of education.
People with a Certificate of Deposit tend to use more online facilites. They also tend to own foreign credit cards.
A lot of people tend to not have any mortgage value.